Data Science/Machine Learning Code Walkthrough

Fall 2018, OIDD314/662

Alex P. Miller, Kartik Hosanagar

{alexmill,kartikh}@wharton.upenn.edu

@alexpmil, @KHosanagar

https://github.com/alexmill/machine-learning-wharton

Main goals:

Understand basics of working with raw data in ML
Understand what "machine learning" looks like in practice
Get a sense of where fancy methods help and where they don't
Give you a jumping off point if you want to learn more

(I will be walking through the code for illustrative purposes, but I can't teach you how to program in 20 minutes!)

In [ ]:

# Import basic functions

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from copy import deepcopy

pd.set_option('display.max_columns', 50)

Dataset: Online Dating Profiles¶

This is a useful, publicly available dataset for demonstrating some common data science techniques (data source). We'll build some toy examples here, but the methods/principles are easily generalizable to other datasets.

Part 1: Basic Data Processing and Prediction¶

In [ ]:

# Load in raw profiles
dating_data = pd.read_csv("./dating_data/profiles_sample.csv", index_col=0)
dating_data.head()

In [ ]:

dating_data.shape

In [ ]:

Question: Can we predict a person's age from their profile characteristics?¶

In business contexts: similar methods can be used to use somebody's profile on your website to predict whether they would be interested in your product.

In [ ]:

# Let's use just these features to try to predict a person's age
# (I'm excluding variables like "kids", which might be dead giveaways.)
prof_cols = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'location', 'job', 'orientation', 'sex', 'smokes', 'speaks']
dating_data[prof_cols].head()

In [ ]:

But wait...¶

Question: How do we get a computer to "understand" a person's dating profile?

Answer: Math! (matrices, linear algebra).

In [ ]:

# Most columns are "categorical"
# e.g., for whether or not someone drinks alcohol, they
# can choose from among the following categories:
dating_data.drinks.unique()

In [ ]:

# To convert this data into a matrix, we will take each 
# category and convert it into a binary column:
dating_data.drinks.str.get_dummies().head(n=20)

In [ ]:

# Note: data is often very messy
# Lots of work in data science is just cleaning/processing data

# Example:
dating_data.pets.unique()

In [ ]:

# I've done the processing work ahead of time for
# the rest of the columns in the dataset

# Load in pre-processed data:
profile_features = pd.read_csv("./dating_data/profile_features.csv", index_col=0)
profile_features.head(n=10)

In [ ]:

Outcome variable: Age¶

In [ ]:

# How to define outcome variable (age)?
age = dating_data.age
age.head()

In [ ]:

_ = plt.hist(age)
_ = plt.title("Distribution of ages in dataset")

In [ ]:

# In most applications, you probably don't need super
# fine precision, i.e., someone's exact age

# Here, we wil "discretize" age into a categorical variable:

# Binary definition; i.e., "is 30 yrs old or younger"
age_30 = (age <= 30)
age_30.head()

In [ ]:

# Categorical definition:

# Define bin boundaries
bins = [0,20,30,40,50,100]

# Use pd.cut function to bin the data
category = pd.cut(age,bins)
age_bins = category.apply(lambda x: str(x))
age_bins.head()

In [ ]:

The magic: "machine learning"!¶

In [ ]:

# Building a basic logistic regression classifier
# using profile features to predict age

from sklearn.linear_model import LogisticRegression

age_logit = LogisticRegression()
age_logit.fit(profile_features, age_30)

In [ ]:

logit_predictions = pd.DataFrame({
    "prediction": age_logit.predict(profile_features),
    "ground_truth": age_30
})

logit_predictions['correct'] = (logit_predictions.prediction == logit_predictions.ground_truth)
logit_predictions.head(n=10)

In [ ]:

# We usually think of "True" as 1 and "False" as 0
logit_predictions.astype(int).head()

In [ ]:

# Evaluate overall accuracy:
logit_accuracy = logit_predictions.correct.mean()
print("Logistic regression accuracy: {:.2f}%".format(logit_accuracy*100))

In [ ]:

Model comparison¶

We'll try making the same prediction, using different machine learning models:

Logistic regression
Decision tree
Random forest

In [ ]:

# Logistic regression
from sklearn.linear_model import LogisticRegression

age_logit = LogisticRegression()
age_logit.fit(profile_features, age_30)
round((age_logit.predict(profile_features)==age_30).mean()*100, 2)

In [ ]:

# Decision Tree
from sklearn.tree import DecisionTreeClassifier

age_dt = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)
age_dt.fit(profile_features, age_30)
round((age_dt.predict(profile_features)==age_30).mean()*100, 2)

In [ ]:

# Random forest
from sklearn.ensemble import RandomForestClassifier

age_rf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)
age_rf.fit(profile_features, age_30)
round((age_rf.predict(profile_features)==age_30).mean()*100, 2)

In [ ]:

A few takeaways:¶

Accuracy isn't amazingly better using fancy method like random forest
Fancy ML methods often only shine with truly big data (10k, 100k, 1m+ observations)
- Not common in most organizations (outside Google, FB, Amazon, Twitter, etc.)
- Lots of news is biased toward breakthroughs at these big comapnies... rarely relevant for business practitioners
The code to run different algorithms is remarkably similar
- With tools like Python/SciKit-Learn, ML coding is a commodity!

In [ ]:

Cross-validated Accuracy (skip for class)¶

If you know what cross-validation is, this is just a short demonstration on how to compare the various models using out-of-sample, cross-validated accuracy measures.

In [ ]:

from sklearn.model_selection import cross_validate

scoring = {
    "accuracy": "accuracy",
    "precision": "precision",
    "recall": "recall",
    "f1": "f1_macro"
}

logit_clf = LogisticRegression()

scoring_obj = cross_validate(logit_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [ ]:

dt_clf = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)

scoring_obj = cross_validate(dt_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [ ]:

rf_clf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)

scoring_obj = cross_validate(rf_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [ ]:

Part 2: Working with Text and Word Embeddings¶

How can we improve performance? One idea: use text inputs from user profiles.

In [ ]:

dating_data[[c for c in dating_data.columns if c.startswith("essay")]].head()

In [ ]:

Using word embeddings on dating profiles¶

Pre-processing¶

Working with text is messy and training vector models can take a long time. I've done essentially all the hard work ahead of time. Details on what I've done:

Take all text input from users and identify the all the unique words used
Get embeddings of all words from a pre-trained word-embedding model
- GloVe, source here
- Trained on 6 billion documents from Wikipedia and Gigaword repository
Average the vector of all the words used by a given user
Save the output in its own file

Result below:

In [ ]:

text_features = pd.read_csv("./dating_data/text_features.csv", index_col=0)
text_features.head()

In [ ]:

# Using embedding of text data to predict age:

age_logit = LogisticRegression()
age_logit.fit(text_features, age_30)
(age_logit.predict(text_features)==age_30).mean()

In [ ]:

# What happens if we combine the profile characteristics and text features?

combined_features = np.hstack((text_features.values, profile_features.values))

age_logit = LogisticRegression()
age_logit.fit(combined_features, age_30)
(age_logit.predict(combined_features)==age_30).mean()

In [ ]:

# What about using fancy methods with fancy word embeddings?

age_rf = RandomForestClassifier(n_estimators=50, max_depth=40, min_samples_leaf=10)
age_rf.fit(text_features, age_30)
(age_rf.predict(text_features)==age_30).mean()

In [ ]:

# BE WARY! This is "in-sample" fit; predictions on "out-of-sample"
# data are actually no better than logistic regression in this case

In [ ]:

Cross-validated accuracy scores (skip for class)¶

In [ ]:

logit_clf = LogisticRegression()
scoring_obj = cross_validate(logit_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [ ]:

rf_clf = RandomForestClassifier(n_estimators=100, max_depth=40, min_samples_leaf=5)

scoring_obj = cross_validate(rf_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [ ]:

Wrapping up¶

This code (source) lists all required packages used in this notebook, making it easy to share this code to run in your own environment.

In [ ]:

import pkg_resources
import types
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        # Some packages are weird and have different
        # imported names vs. system/pip names. Unfortunately,
        # there is no systematic way to get pip names from
        # a package's imported name. You'll have to had
        # exceptions to this list manually!
        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name
imports = list(set(get_imports()))

# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names 
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

for r in requirements:
    print("{}=={}".format(*r))

In [ ]: